AITopics | synthesized data

Collaborating Authors

synthesized data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling

Zhang, Xin, Wei, Qiyu, Zhu, Yingjie, Zhang, Linhai, Zhou, Deyu, Ananiadou, Sophia

arXiv.org Artificial IntelligenceMar-6-2025

User reviews on e-commerce platforms exhibit dynamic sentiment patterns driven by temporal and contextual factors. Traditional sentiment analysis methods focus on static reviews, failing to capture the evolving temporal relationship between user sentiment rating and textual content. Sentiment analysis on streaming reviews addresses this limitation by modeling and predicting the temporal evolution of user sentiments. However, it suffers from data sparsity, manifesting in temporal, spatial, and combined forms. In this paper, we introduce SynGraph, a novel framework designed to address data sparsity in sentiment analysis on streaming reviews. SynGraph alleviates data sparsity by categorizing users into mid-tail, long-tail, and extreme scenarios and incorporating LLM-augmented enhancements within a dynamic graph-based structure. Experiments on real-world datasets demonstrate its effectiveness in addressing sparsity and improving sentiment modeling in streaming reviews.

data synthesis, proceedings, synthesized data, (15 more...)

arXiv.org Artificial Intelligence

2503.04619

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.05)
Asia > Taiwan > Taiwan Province > Taipei (0.04)
(13 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Services > e-Commerce Services (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

Su, Hongjin, Sun, Ruoxi, Yoon, Jinsung, Yin, Pengcheng, Yu, Tao, Arık, Sercan Ö.

arXiv.org Artificial IntelligenceJan-18-2025

Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks -- baseline results are improved by up to 12.2\% for ICL with Claude-3.5 and 19.5\% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0\% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.

large language model, learn-by-interact, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.10893

Country: Asia > China > Hong Kong (0.04)

Genre:

Workflow (0.93)
Research Report (0.82)

Industry: Information Technology > Services (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

SemiDFL: A Semi-Supervised Paradigm for Decentralized Federated Learning

Liu, Xinyang, Han, Pengchao, Li, Xuan, Liu, Bo

arXiv.org Artificial IntelligenceDec-18-2024

Decentralized federated learning (DFL) realizes cooperative model training among connected clients without relying on a central server, thereby mitigating communication bottlenecks and eliminating the single-point failure issue present in centralized federated learning (CFL). Most existing work on DFL focuses on supervised learning, assuming each client possesses sufficient labeled data for local training. However, in real-world applications, much of the data is unlabeled. We address this by considering a challenging yet practical semisupervised learning (SSL) scenario in DFL, where clients may have varying data sources: some with few labeled samples, some with purely unlabeled data, and others with both. In this work, we propose SemiDFL, the first semi-supervised DFL method that enhances DFL performance in SSL scenarios by establishing a consensus in both data and model spaces. Specifically, we utilize neighborhood information to improve the quality of pseudo-labeling, which is crucial for effectively leveraging unlabeled data. We then design a consensusbased diffusion model to generate synthesized data, which is used in combination with pseudo-labeled data to create mixed datasets. Additionally, we develop an adaptive aggregation method that leverages the model accuracy of synthesized data to further enhance SemiDFL performance. Through extensive experimentation, we demonstrate the remarkable performance superiority of the proposed DFL-Semi method over existing CFL and DFL schemes in both IID and non-IID SSL scenarios.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2412.13589

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Abaskohi, Amirhossein, Gella, Spandana, Carenini, Giuseppe, Laradji, Issam H.

arXiv.org Artificial IntelligenceDec-17-2024

Multimodal multihop question answering is a complex task that requires reasoning over multiple sources of information, such as images and text, to answer questions. While there has been significant progress in visual question answering, the multihop setting remains unexplored due to the lack of high-quality datasets. Current methods focus on single-hop question answering or a single modality, which makes them unsuitable for real-world scenarios such as analyzing multimodal educational materials, summarizing lengthy academic articles, or interpreting scientific studies that combine charts, images, and text. To address this gap, we propose a novel methodology, introducing the first framework for creating a high-quality dataset that enables training models for multimodal multihop question answering. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks, our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average. We believe our data synthesis method will serve as a strong foundation for training and evaluating multimodal multihop question answering models.

information, natural language, question answering, (19 more...)

arXiv.org Artificial Intelligence

2412.0703

Country:

North America > Canada > Quebec > Capitale-Nationale Region > Québec (0.05)
North America > Canada > Quebec > Capitale-Nationale Region > Quebec City (0.05)
Asia > China (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Government > Military (1.00)
Leisure & Entertainment (0.93)
Government > Regional Government > North America Government > United States Government (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

Add feedback

EAPCR: A Universal Feature Extractor for Scientific Data without Explicit Feature Relation Patterns

Yu, Zhuohang, An, Ling, Li, Yansong, Wu, Yu, Dong, Zeyu, Liu, Zhangdi, Gao, Le, Zhang, Zhenyu, Zhou, Chichun

arXiv.org Artificial IntelligenceNov-12-2024

Conventional methods, including Decision Tree (DT)-based methods, have been effective in scientific tasks, such as non-image medical diagnostics, system anomaly detection, and inorganic catalysis efficiency prediction. However, most deep-learning techniques have struggled to surpass or even match this level of success as traditional machine-learning methods. The primary reason is that these applications involve multi-source, heterogeneous data where features lack explicit relationships. This contrasts with image data, where pixels exhibit spatial relationships; textual data, where words have sequential dependencies; and graph data, where nodes are connected through established associations. The absence of explicit Feature Relation Patterns (FRPs) presents a significant challenge for deep learning techniques in scientific applications that are not image, text, and graph-based. In this paper, we introduce EAPCR, a universal feature extractor designed for data without explicit FRPs. Tested across various scientific tasks, EAPCR consistently outperforms traditional methods and bridges the gap where deep learning models fall short. To further demonstrate its robustness, we synthesize a dataset without explicit FRPs. While Kolmogorov-Arnold Network (KAN) and feature extractors like Convolutional Neural Networks (CNNs), Graph Convolutional Networks (GCNs), and Transformers struggle, EAPCR excels, demonstrating its robustness and superior performance in scientific tasks without FRPs.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2411.08164

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Therapeutic Area > Endocrinology (0.93)
Health & Medicine > Therapeutic Area > Oncology (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in Healthcare

Li, Zhikai, Zhang, Jing, Gu, Qingyi

arXiv.org Artificial IntelligenceSep-14-2024

The disparity in healthcare personnel expertise and medical resources across different regions of the world is a pressing social issue. Artificial intelligence technology offers new opportunities to alleviate this issue. Segment Anything Model (SAM), which excels in intelligent image segmentation, has demonstrated exceptional performance in medical monitoring and assisted diagnosis. Unfortunately, the huge computational and storage overhead of SAM poses significant challenges for deployment on resource-limited edge devices. Quantization is an effective solution for model compression; however, traditional methods rely heavily on original data for calibration, which raises widespread concerns about medical data privacy and security. In this paper, we propose a data-free quantization framework for SAM, called DFQ-SAM, which learns and calibrates quantization parameters without any original data, thus effectively preserving data privacy during model compression. Specifically, we propose pseudo-positive label evolution for segmentation, combined with patch similarity, to fully leverage the semantic and distribution priors in pre-trained models, which facilitates high-quality data synthesis as a substitute for real data. Furthermore, we introduce scale reparameterization to ensure the accuracy of low-bit quantization. We perform extensive segmentation experiments on various datasets, and DFQ-SAM consistently provides significant performance on low-bit quantization. DFQ-SAM eliminates the need for data transfer in cloud-edge collaboration, thereby protecting sensitive data from potential attacks. It enables secure, fast, and personalized healthcare services at the edge, which enhances system efficiency and optimizes resource allocation, and thus facilitating the pervasive application of artificial intelligence in worldwide healthcare.

artificial intelligence, machine learning, quantization, (17 more...)

arXiv.org Artificial Intelligence

2410.01813

Country:

Asia > China > Beijing > Beijing (0.04)
Europe > Spain > Andalusia > Granada Province > Granada (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Training Task Experts through Retrieval Based Distillation

Ge, Jiaxin, Jia, Xueying, Viswanathan, Vijay, Luo, Hongyin, Neubig, Graham

arXiv.org Artificial IntelligenceJul-7-2024

One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.

dataset, rebase, synthesized data, (16 more...)

arXiv.org Artificial Intelligence

2407.05463

Country:

Europe > United Kingdom > England (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(4 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement

Feng, Yunzhen, Dohmatob, Elvis, Yang, Pu, Charton, Francois, Kempe, Julia

arXiv.org Machine LearningJun-11-2024

Synthesized data from generative models is increasingly considered as an alternative to human-annotated data for fine-tuning Large Language Models. This raises concerns about model collapse: a drop in performance of models fine-tuned on generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of feedback on synthesized data to prevent model collapse. We derive theoretical conditions under which a Gaussian mixture classification model can achieve asymptotically optimal performance when trained on feedback-augmented synthesized data, and provide supporting simulations for finite regimes. We illustrate our theoretical predictions on two practical problems: computing matrix eigenvalues with transformers and news summarization with large language models, which both undergo model collapse when trained on model-generated data. We show that training from feedback-augmented synthesized data, either by pruning incorrect predictions or by selecting the best of several guesses, can prevent model collapse, validating popular approaches like RLHF.

generator, synthesized data, verifier, (12 more...)

arXiv.org Machine Learning

2406.07515

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

A Tale of Tails: Model Collapse as a Change of Scaling Laws

Dohmatob, Elvis, Feng, Yunzhen, Yang, Pu, Charton, Francois, Kempe, Julia

arXiv.org Artificial IntelligenceFeb-10-2024

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2.

model collapse, scaling law, synthesized data, (16 more...)

arXiv.org Artificial Intelligence

2402.07043

Country:

North America > United States > New York (0.04)
North America > Dominican Republic (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.45)

Add feedback

Filters

Collaborating Authors

synthesized data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

f03ce573aa8bce26f77b76f1cb9ee979-Paper-Conference.pdf

SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

SemiDFL: A Semi-Supervised Paradigm for Decentralized Federated Learning

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

EAPCR: A Universal Feature Extractor for Scientific Data without Explicit Feature Relation Patterns

Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in Healthcare

Training Task Experts through Retrieval Based Distillation

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement

A Tale of Tails: Model Collapse as a Change of Scaling Laws